General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). In sum our contributions are: 1) scaling Perceiver-type models to raw high-resolution images and audio+video, 2) showing the feasibility of learning 1M+ positional embeddings from scratch using masked auto-encoding, 3) demonstrating competitive performance on raw data from ImageNet, AudioSet, PASCAL VOC, ModelNet40 and Kinetics datasets with the same exact, unchanged model and without specialized preprocessing or any tokenization.
translated by 谷歌翻译
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
translated by 谷歌翻译
内窥镜图像的估计深度是广泛的AI辅助技术的预先确定,即精确定位,肿瘤测量,或识别未被检查的区域。作为结肠镜片的域特异性 - 一种可变形的低纹理环境,具有流体,较差的照明条件和突然的传感器运动 - 对多视图方法构成挑战,单视深度学习被视为一个有希望的研究线。在本文中,我们探讨了在结肠镜检查中的单视深度估计的第一次贝叶斯深网络。它们的不确定性量化为这种关键应用领域提供了极大的潜力。我们的具体贡献是两倍:1)对贝叶斯深网络进行深度估计的详尽分析,以三个不同的数据集,突出了关于综合对象变化和监督与自我监督方法的挑战和结论; 2)一种新的教师学生对深度深度学习的方法,考虑到教师不确定性。
translated by 谷歌翻译
我们从单一的图像处理多的人三维人体姿势和体形估计的问题。虽然这个问题可以通过应用单人对同一场景接近多次来解决,近期的作品都示出的建筑在深深的架构,同时推理通过强制执行,例如,深度为了所有的人都在现场以整体方式的优点限制或重建的机构之间相互渗透最小化。但是,现有的方法仍然无法捕捉所造成的内在的体规模和深度的模糊人的规模变化。在这项工作中,我们处理的,通过强制所有的人的脚留在地面制定是学习的适当机构的规模和相对相机姿态新颖的优化方案,这一挑战。在MuPoTS-3D和3DPW数据集进行彻底的评估表明,我们的做法是能够稳健地估计多的人的身体翻译和形状,而取回自己的空间布置,始终如一改善当前国家的最先进的,尤其是在场面与人非常不同的高度
translated by 谷歌翻译
Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, Audioset and ESC-50 when compared to previous self-supervised work. Our models are publicly available [1, 2, 3]. * Equal contribution. † Work done during an internship at DeepMind. 34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译